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Abstract. The performance of principal component analysis (PCA) suffers badly in the 
presence of outliers. This paper proposes two novel approaches for robust PCA based on 
semidefinite programming. The first method, maximum mean absolute deviation rounding 
(MDR), seeks directions of large spread in the data while damping the effect of outliers. The 
second method produces a low-leverage decomposition (LLD) of the data that attempts to 
form a low-rank model for the data by separating out corrupted observations. This paper 
also presents efficient computational methods for solving these SDPs. Numerical experiments 
confirm the value of these new techniques. 



1. Introduction 

Principal component analysis (PCA), proposed in 1933 by Hotelling 23!, is a common tech- 
nique for summarizing high-dimensional data. Principal components are designed to identify 
directions in which the observations vary most. As a consequence, PCA is often used to reduce 
the dimension of the data. 

Statistics based on variance, such as principal components, are highly sensitive to outliers |43j . 
The literature on robust statistics contains a wide variety of techniques that attempt to correct 
this shortcoming 25]. Unfortunately, many of these approaches are based on intractable opti- 
mization problems or lack a principled foundation. 

Our focus in this work is to develop new formulations for robust PCA that can be solved 
efficiently using convex programming algorithms. Our first proposal, which we call maximum 
mean absolute deviation rounding (MDR), exchanges the variance in the definition of PCA 
with a function less sensitive to outliers known as the mean absolute deviation. Although this 
formulation leads to a non-convex optimization problem, we demonstrate that it is possible to 
approximate the optimum by relaxing to a semidefinite program and randomly rounding the 
solution. This method can be viewed as a specific instance of projection-pursuit PCA |26j . 

Our second proposal uses a different semidefinite program to split the input data into the 
sum of a low-leverage matrix and a matrix of corrupted observations. We refer to this dissection 
as a low-leverage decomposition (LLD) of the data. This method is similar in spirit to the 
rank-sparsity decomposition of Chandrasekaran et al. 7J- While preparing this manuscript, we 
learned of an independent investigation into this formulation of robust PCA by Xu et. al. [46ll47] . 

We describe algorithms that solve these semidefinite programs efficiently, and we provide 
numerical experiments that confirm the effectiveness of these new techniques. We begin with a 
brief overview of our proposals before laying out the details in Sections [2] and [3] 

1.1. The Data Model. Suppose that we have a family {xi}2^^ of n observations in p dimen- 
sions. We form an n x p data matrix X whose rows are the observations. The observations are 
assumed to be centered; that is, ^ Xi w 0. While our methods do not explicitly require the 
data to be centered, this hypothesis allows us to interpret principal components as directions of 
high variance in the data. We discuss practical centering approaches in Section [5j 
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1.2. Maximizing the Mean Absolute Deviation. Our first method is designed to mitigate 
a source of sensitivity in classical principal component analysis. The top principal component 
vpcA is defined as a direction of maximum variance in the data: 

^^PCA = argmax V \{x,,v)f . (1.1) 

lklU=i 



The squared inner products in (1.1) may lead to outsized influence of outlying points because 
squaring a large number results in a huge number, which can drag the principal component away 
from the bulk of the data. We can reduce this effect by replacing the squared inner product with 
a measure of spread that is less sensitive. We propose the use of the absolute value of the inner 
product: 

En 
\{xi,v)\, (1.2) 

where we have added the subscript MD to indicate that we have exchanged the variance in 



equation (1.1 1 with a measure of spread known as the mean absolute deviation (MD) |251 p. 2]. 



This revision results in some complications. The formulation (1.1) is an eigenvector problem 
which can be solved efficiently. In contrast, it is NP-hard to compute fMD- Nevertheless, 
we develop an efficient randomized algorithm that provably computes an approximate solution 



to (1.2). We call this approach maximum mean absolute deviation rounding (MDR). 

Our main result. Theorem |2.2[ states that, for any failure probability 5 > and loss factor 
e > 0, our algorithm produces a unit-norm vector i^mdr such that 

^ — / 2 ^ — \n 

> . |(a;,,t>MDR)| > a/-(1-£) max > . \{x„v)\. 

The algorithm requires that we solve one semidefinite program (SDP) whose size is polynomial 
in the number of observations. Since SDPs are solvable in polynomial time using interior-point 
methods, our algorithm is tractable in principle. In practice, solving SDPs can be daunting even 
for moderately sized input data — say, more than 100 observations. To address this issue, we 
detail a technique of Burer and Monteiro [H |S] that can usually solve the SDP efficiently, and 
in Section |5] we provide some numerical evidence that this approach succeeds. 

We find additional components by greedily restricting the data to a subspace perpendicular 



to the previous components and solving (1.2) again. 

This proposal is not without precedent. A more general formulation appears in Ruber's 
book [Ml p. 203], and it is now known as projection-pursuit PC A (PP-PCA) [33]. We provide 



further detail on PP-PCA in Section 2.2 and discuss the history of the method in 4.1 



1.3. A Low-Leverage Decomposition. Our second proposal stems from a different interpre- 
tation of classical principal component analysis. Instead of viewing classical principal compo- 
nents as directions of maximum variance, we can view them as an optimal low-rank model for 
the data [5]. Suppose is a matrix that solves 

minimize \\X — P||p 
subject to rank(P) = T. 

The dominant principal components of X. are given by the T right singular vectors of P^ 
corresponding with the nonzero singular values of P^ . 

With real data, one is often faced with the situation where entire observations are corrupted. 
If this is the case, we would still like to recover a low-rank model. We can develop as natural 
formulation for identifying a low-rank model using the well-known rank sparsity [15 and group 
sparsity [37!; heuristics. We propose to decompose the data matrix as X = Plld + C'lld by 
solving the semidefinite program 

minimize E» o-i(-P) + 7 II2 o\ 

subject io P + C^X. ^ ' 

We have written <Ji{P) for the ith singular value of P and Ci for the ith row of C. 

We view the optimal matrix Plld as a surrogate for the low-rank approximation to the 
uncorrupted data, and the optimal matrix Clld as an approximation of the corrupted data. 



The formulation (1.3) has an interesting property even when Plld is not low-rank or Clld is 
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not row-sparse: -Plld is guaranteed to be a low-leverage set of observations in a sense we make 
precise in Section [3.1| As a result, we refer to X = -Plld + C'lld as a low-leverage decomposition 
(LLD) of the data. We define the dominant LLD components as the right singular vectors of 
-Plld- 

This optimization problem is similar to the rank-sparsity decomposition problem proposed 
in [7]; see also [5]. We discuss these ideas at more length in Section |4] As this manuscript was 
being prepared, we learned of an independent investigation of the program (1.3) for robust PCA 
by Xu et. al. |461 147j that provides conditions for recovery of the support of the corruption and 
the row-space of the uncorrupted observations. 

1.4. Road map. Sections [2] and [3] describe our proposals in more detail, including theoretical 
guarantees and practical algorithms. Section [4] offers an overview of previous work on robust 
PCA, while Section [5] describes numerical experiments illustrating the performance of our meth- 
ods in various settings. A technical appendix contains the proofs of supporting results. 

1.5. Notation. We work exclusively with real numbers. The symbols P and E denote probability 
and expectation, respectively. We use d to denote the subgradient map. 

Bold capital letters denote matrices while bold lower-case letters denote vectors. We represent 
the ith row of a matrix A by a,; and the jth entry of a vector a by Oj. The adjoint of a matrix 
A is written A* . When referring to matrix elements, we sometimes use the notation [AJ^j, and 
similarly for vectors we use [a\i. 

We use the compact convention for the singular value decomposition (SVD) of a matrix: 
when A is rank r, we write its SVD as A = IfSV* , where U and V have orthonormal columns, 
and S is a non-singular diagonal matrix whose entries are positive and are arranged in weakly 
decreasing order. The notation A B denotes that A — B is positive semidefinite. 

1.5.1. Norms. We denote the £p vector norm as 11^11^ = l^iD^^^ 

= maxi \ui\. The Frobenius norm of a matrix is defined by ||A|jp 
represents the standard inner product 
denoted A^ . 

We define the £p to £q operator norm and its dual respectively by 



for I < p < oo and 

= {A, A), where (-, •) 
The Moore-Penrose pseudoinverse of a matrix A is 



\A\ 



sup 

■"ll„ = l 



\Au\\ 



and 



\B\ 



sup 

IIAII,,^ =1 



(B,A) 



2.3 



and 



2.4 



Table [T] describes some of the specific operator norms used in this work. We also use the norms 
||A.||2_j.i and ||A||^_j.-^, which lack such simple descriptions; see Sections 

The operator norm of the adjoint satisfies WA" 
conjugacy relations 1/p + 1/p* = 1 and 1/q + 1/q 



— \\A\\p^^ where p and q satisfy the 
1 with the convention l/oo 



0. 



Table 1. Summary of the norms used in this work. 



Norm 


Description 


Description of Dual 


ll^ll2->2 


Maximum singular value of A 


Sum of the singular values of A 


ll^ll2^oo 


Maximum £2 row norm of A 


Sum of the £2 row norms of A 


ll^lll^oo 


Maximum absolute entry of A 


Sum of the absolute entries of A 



2. Maximum Mean Absolute Deviation Rounding 

Our first method is based on the classical interpretation of the top principal component as the 
direction of maximum empirical variance in multidimensional data. It has long been recognized 
that the variance is highly sensitive to outliers in the data [33]. The field of robust statistics 
has reacted by developing and analyzing robust measures of spread known as robust scales; see 
[221 Ch. 5] or [3ni Sec. 2.5]. This literature describes a generic method for determining robust 
principal components by replacing the variance with a robust measure of scale. Li and Chen [55] 
published the first investigation of this under the name projection-pursuit PCA (PP-PCA) . Our 
proposal is a specific instance of PP-PCA with the mean absolute deviation scale (2.1 ). We show 
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that this formulation is computationally intractable, but we develop an algorithm that provably 
approximates its solution. To our knowledge, this is the first rigorous algorithm for PP-PCA 
with a robust scale. 

2.1. Scales. A scale is a function that measures the spread of one-dimensional data [251 Ch. 5]. 
By far, the most common scale is the empirical standard deviation, definecj^ as 

1 /2 

std(y) = yf ) =||y||2, 

where we we assume the data y is centered. Of course, the standard deviation is not the only 
way to measure the spread of the data. An alternative proposal |25l p. 2] is the mean absolute 
deviation (MD). For centered data y, the MD scale is defined as 

MD{y) = Y.\y^,\ = \\y\\v (2-1) 

i 

More generally, a scale is a function S : M" — >■ M such that S{ay) = |Q!|S'(y). Scales are 
typically chosen so that they are less sensitive to outliers than the standard deviation. The 
robust statistics literature focuses on scales that have a positive breakdown point: the value 
of the scale cannot be arbitrarily corrupted by nefariously chosen observations, so long as the 
fraction of bad observations in the entire data set is small. Although the mean absolute deviation 
has a breakdown point of zero, it exhibits more efficient behavior than the standard deviation 
under contaminated distributions |43) . 

2.1.1. Scales for multivariate data. We extend the definition of scales to multivariate data by 
considering the scale of the data in a given direction. The projection of the rows of X onto 
the unit direction u is given by the product Xu. Note that if X is centered in the sense of 
Section |1.H then the projection Xu is also centered by linearity. We define the scale of X in 
the direction u to be the scale of the projected data S{Xu). 

As noted in |24| . this definition is equivariant under an orthogonal change of basis: for any Q 
with Q*Q = 1, the scale of X in the direction u is equal to the scale of XQ* in the direction 
Qu. 

2.2. Projection-Pursuit PCA. Classically, the top principal component is defined as the 
direction where the empirical standard deviation in the data is largest: 

■wpcA = argmax std(Xi;). (2-2) 



A natural approach for finding robust components is to replace the standard deviation in (2.2) 



with a robust scale >S'(-), so that the robust component is the direction of maximum robust scale 

vpp — aigma,xS{Xv). 

Ihll2 = l 

We define further robust components inductively by adding orthogonality constraints: 

Vpp — argmax S{Xv). (2.3) 

II u|| ^ — 1 
v±v^Jl Vj<k 

This greedy method of constructing orthogonal components based on robust scales goes by the 
name projection-pursuit PCA. This scheme was originally proposed by Huber '24J p. 203], but 
was first studied in detail by Li and Chen [26.. PP-PCA reduces to PCA when the scale is given 
by the standard deviation due to the variational characterization of eigenvectors by Courant and 
Fischer. 

To implement the PP-PCA method, one only needs a method that finds the first component. 
We discuss how to enforce the orthogonality constraints in Section [2.6.11 



^One usually defines scales so that they are unbiased estimates of the sample standard deviation when the 
data is drawn from a normal distribution. We are more interested in the direction of maximal scale rather than 
the value, so we can safely ignore the normalization factor. 
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2.3. PP-PCA with the MD Scale is NP-Hard. Finding the top principal component is an 
eigenvector problem that amounts to computing the direction where the norm ||-||2_>.2 is achieved. 
Similarly, PP-PCA with the MD scale amounts to finding a vector that achieves an operator 
norm. Indeed, the problem vmd = arg max||„||_^^;^ ll-^'^lli is equivalent to the problem 

find ||fMD|l2 = 1 such that ||Xt;MD|li = ||-'^|l2-j.i- (2-4) 

Unfortunately, exchanging the £2 norm for the £1 norm leads to an NP-hard computational 
problem. To see this, we require the following result, which we establish in the Appendix. 

Fact 2.1. For each matrix X, the identity WXW^^i = ||^_^-^ holds. 

Rohn shows that there exists a class of well-conditioned positive matrices A4 such that 
the existence of a polynomial-time algorithm for accurately computing || A4'||^_j.-^ for all M E Ai 
implies P = NP. Since we can factor positive matrices M = RR* in polynomial time using, for 
example, a Cholesky factorization, the existence of an accurate polynomial-time algorithm that 
computes ||i?||2_j.i for any matrix R implies that P = NP. 



The observation that Equation (2.3) is NP-hard to solve for the specific choice S{-) ~ \\-\\^ has 
serious implications for existing PP-PCA algorithms. The algorithms available in the literature 
for PP-PCA HU 121] are general schemes that claim to work for any choice of scale 5*. As a 
result, none of these algorithms can provide both accurate and efhcient solutions to the PP-PCA 
problem. This issue is not merely theoretical because these algorithms tend to perform poorly 
in practice. We discuss this point further in Section [4T| 

2.4. Approximating the £2 £1 Norm using Randomized Rounding. Although it is 
NP-hard to compute the £2 £1 norm, it is possible to approximate its value efficiently. This 
fact is a consequence of the little Grothendieck theorem [321 Sec. 5b], but the algorithm depends 
on ideas of Nesterov |33], a technique of Burer and Monteiro [HIS], and a new factorization step. 

2.4.1. The semidefinite relaxation of the £2 — > £1 norm. Before describing our algorithm, we 
begin by showing how the computation of £2 — > £1 operator norm can be relaxed to a semidef- 
inite program. First, apply Fact |2.1| to change the computation of the £2 — > £1 norm to the 
computation of the £ao £1 norm: 

= - max y*XX*y. (2.5) 

\\y\\ —1 
II y II 



The second identity above follows from the proof of Fact 2.1 see also |3SJ Prop. 1]. Interpreting 



the quadratic form on the right hand side of (2.5) as a trace implies that is the optimal 

value of the (non-convex) program 

maximize trace(XX* Z) 

^ ' (2.6) 

subject to Z = yy*, [Z]u — 1 for all i. 

Relaxing the rank one constraint Z = yy* to a positive-semidefinite constraint Z )^ leads to 
the SDP 

maximize traA^eiXX* Z) 

^ ' (2.7) 

subject to Z )p 0, [Z]ii ~ 1 for all i. 



It follows that ||X||2_j.]^ < a*, where is the optimal value of (2.7). Moreover, Grothendieck's 
inequality for positive-semidefinite matrices implies that 

al<\^XX*\\^^,, (2.8) 

where this inequality is asymptotically the best possible [H Sec. 4.2]. Thus, is within a factor 
of ^7r/2 < 1.26 of the true value of the norm ]]X]|2_j.]^. 
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Algorithm 1: Maximum Mean Absolute Deviation Rounding 

Input: An n x p matrix X; repetition count K. 

Output: A p x 1 unit-norm vector v^, and an optimal value a*. 

(1) Find an i?^ such that = -R*-R* solves the semidefinite program 

maximize traceiZXX*) 

(2-10) 

subject to Z )^ 0, [Z]ii — 1 for i — 1, . . . ,n 

Set a* to be the square root of the optimal value: a^, — y^tTace{Zi,XX*). 

(2) For each fc = 1, . . . , if, do 

(a) Set y^''^ = sgn(i?*g^'^)), where g*^*^) is an n x 1 standard normal random 
vector. 

(b) Set vC^) = X*y('^-)/||^*y''^''||2- 

(3) Set v^, = argmaxj.^;^ || A?;''^^ || ^. 



2.5. The MDR Algorithm. The fact that equation (2.7 1 gives us a good upper bound on 
the value of ||A||2_j.]^ is of secondary importance. We would prefer an approximation for vmd 
in (2.4|, that is, a vector Vi, with ||t'*||2 = 1 such that ||At)i|| w ||A|j2_j.]^. We accomplish this 



goal via a randomized procedure that rounds an optimal solution Z^, to ( |2.7[ ) back to a vector 
v^,. The entire procedure is detailed in Algorithm [T] 

The first step of the algorithm solves the SDP relaxation (2.7). In Step[2][a), we draw a random 



y e {±1}" with E II AA*y||j = 2aJ/7r. This procedure is well understood [34,. The method in 
Step[2jb) that we use to compute v from y is novel, and it requires a proof of correctness, which 
appears in the Appendix. By choosing the best random outcome, Step [3] limits the probability 
that our method fails to provide a reasonable approximation. 
The following theorem describes the behavior of Algorithm [T] 

Theorem 2.2. Suppose that X is an n x p matrix, and let K be the number of rounding trials. 
Let {Vi,,ai,) be the output of Algorithm^ Then a^, > WXW^^^. Moreover, for 6 < 1, the 
inequality 

\\Xv4,>9^a, (2.9) 
holds except with probability e~'^^^^~^ )h ^ 



In Theorem 2.2 it may be more natural to specify a failure probability 5 > and approxi- 
mation loss e = 1 — 6* > instead of a repetition number K. In this case, simple algebra shows 
that IIAv^ll-^ > (1 — £)-y/2/7r|| A||2_j.;^ except with probability (5, so long as 

In particular, the choice K — 94 implies that ||At;^||-^ > 0.75|| A||2^]^ with probability at least 
0.999. 

We use the approximation ratio p = \\XVi,\\-^^/a-t: to measure the quality of the optimal solution 
in Sectionji] Although Theorem 2.2 only guarantees that we can make p as close to \/2/tt > 0.79 



as we desire, in practice we typically see a 0.95 approximation ratio or higher. This observation 
does not indicate that the analysis of the algorithm is loose; it follows directly from [31 Sec. 4.2] 
that this bound is asymptotically tight for a class of examples as n ^ oo. 

2.6. Implementation of Algorithrnjl] For a fixed iteration count K, the complexity of Algo- 
rithmjljis typically dominated by Step[lj When applied to (2.10), modern interior-point methods 



are guaranteed to compute the optimal objective value and optimal point Z^, accurately in 
polynomial time. The factor R^, is determined using a Cholesky factorization of Z-^. In practice, 
interior-point methods are very slow for large-scale problems, so we prefer an algorithm of Burer 
and Monteiro 

The algorithm of Burer and Montiero never forms the semidefinite matrix Z; rather it oper- 



ates directly with the factor R. We express the objective function of (2.10) in terms of R as 
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p. The constraints [Z]i 
1. 



1 are equivalent to constraints on the 



tTaceiRR*XX*) = \\X*R\\ 
rows of R of the form ||ri|j2 

We imphcitly enforce these row constraints by incorporating them into the objective function 
as in |4i Sec. 4.2]. The resulting unconstrained, nonconvex optimization problem takes the form 



maximize ||X*A/'(-R)||p, 
where A/'(-R) denotes the operator that normalizes the rows of R, that is, [N'{R)]ij 



(2.11) 



We then apply a conjugate gradient algorithm to maximize the unconstrained objective 



in (2.11). Our particular implementation uses the algorithm of Hager and Zhang j22j, which 



we have found to work well in our experiments. We refer to our online code for the choice of 
parameters in this conjugate gradient algorithm '31]. 



This factorization technique for solving (2.10) is advantageous because it reduces the dimen- 



sion of the problem. The paper [5] shows that restricting i? to be an n x A: matrix for k = 0{^/n) 
suffices to solve this problem exactly. To be precise, when k = [(1 + \/9 + 8n)/2j any local min- 
imum Ri, g 



pxfc q£ ( |2.ii[ ) gives a global minimum of ( |2.10[ ) via the map 
provided a mild technical conditioij^ holds. 

2.6.1. Orthogonal Restriction. Algorithm [T] only approximates the first principal component 
in (1.2). In order to approximate the fcth robust principal component for fc > 1, we define 



a new matrix Xk by restricting the rows of X to the subspace perpendicular to the span of 
Vi, . . . ,Vk-i. Ignoring numerical stability, we can inductively define 

Xk = Xk-i - Xvk-ivl_^^ X(l-Y!i2lv3'>^^), (2-12) 

which ensures each row of X is orthogonal to the previous components Vj for j < k. We then 
apply Algorithm 111 to the restricted matrix Xk to produce the component Vk- Since the output 

of Algorithm ^ is a linear combination of the rows of the input matrix by Step , this 
iterative procedure ensures that v^. is perpendicular to the previous components. 

In practice, the implementation can be done using Householder reflections as in [TT]; see [H] 
for further background on the implementation of Householder transformations. Householder 



reflections are more numerically stable than the naive method (2.12). Moreover, they take full 



advantage of the fact that we are only searching over ap— fc + 1 dimensional subspace by reducing 
the dimension of X^ to n x (p — k + 1). 

2.7. Extending the Rounding to Multiple Components. We have also attempted to ex- 
tract a collection of robust components simultaneously by solving a single semidefinite program. 
That is, we would like to solve the problem 



maximize I^Li ll^'^illi 
subject to {vi,Vj) = Sij 



(2.13) 



where Sij is the Kronecker delta function. When T = 1, equation (2.13) is equivalent with (1.2) 



When T > 1, the restriction {vi,Vj) = 6ij ensures that the optimum occurs at an orthogonal set 
of unit vectors. 

We can rephrase this optimization problem by the equivalent quadratically constrained qua- 
dratic program 

maximize X]"=i ''^i -^''^j 
subject to diag{w iW*) = 1, {vi,Vj) = Sij 

The diagonal restrictions on Wi ensure that Wi £ {±1}" for each i — 1, . . . ,n. The nonconvex 
problem (2.14) can be approximated via a semidefinite relaxation proposed in |33| . The results 



(2.14) 



of |41| imply that the optimal value of this relaxation is guaranteed to be larger than the optimal 
value of (2.13) by no more than a logarithmic factor. The rounding procedure does not produce 



orthogonal vectors, so we need to apply an additional orthogonalization step to achieve feasibility 



for (2.13). Empirically, we have found that the orthogonalization increases the objective value 



over the standard rounding, so it appears that there is no loss in applying this procedure. 



Specifically, the objective function tva,cc{Z X X*) must not be constant along a face of the feasible set. 
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Algorithm 2: Low-Leverage Decomposition 

Input: An n x p data matrix X; desired number of principal components T. 
Output: A p x T matrix with orthogonal columns. 

(1) Find (P^,C^) that solve 

minimize(p,c) Il^'ll2^2 + 7l|C'll2->oo .32) 
subject to P + C = X 

(2) Compute the SVD ^ UHV* . 

(3) Set V* to the first T columns of V, that is, set 

[V4,, = [V],, for z = 1, . . . ,p, and J = 1, . . . , T. 



Unfortunately, this method does not appear to be competitive with the projection pursuit 
method. The vectors we find by coupling Algorithm[l]with the orthogonal pursuit of Section [2.6.1| 



are feasible for (2.14) and typically provide a larger objective value than rounding coupled with 
post-processing orthogonalization. A better rounding procedure for this type of relaxation may 
prove more effective than the projection-pursuit approach; this is a direction for further research. 

3. The Low-Leverage Decomposition 

Our second method is derived from the interpretation of principal component analysis as 
a matrix approximation problem. When the observations are drawn from a highly correlated 
family, the singular values of the data matrix X tend to decay rapidly. If this is the case, then 
the matrix X is well approximated by a low-rank matrix P. 

It is rare that a large data set can be compiled without error, but it is often the case that 
the errors only affect a subset of the observations. We can model these errors through a multi- 
population model. Suppose that the bulk of the observations is well-explained by a low-rank 
model while the remainder come from another population or are corrupted by measurement noise. 
A prudent approach to robust principal component analysis would first separate the corrupted 
data from the uncorrupted data before attempting to recover a low-rank model. When the 
corrupted rows are unknown, this task may seem daunting. 

To accomplish this task, we propose a semidefinite program that decomposes the input X 
into two matrices: 

minimize(p,c') I!-P|l2^2 + tIICH*^^ 
subject to P + C = X. 

The norm ||i-*||2_^2 sum of the singular values of P and is known to promote low-rank 

solutions jl5j . while ||C'||2_j.oo is the sum of the £2 norms of the rows of C and promotes group 
sparsity [57] . 



We call the optimal matrix pair {Pi,, C^,) for the problem (3.1 ) the low-leverage decomposition 
(LLD) of X; we can interpret Ci, as an identified corruption and as a surrogate for the 
uncorrupted observations. We define our robust components as the right singular vectors of the 
surrogate matrix P^,. The detailed procedure appears in Algorithm [2j We show in Section [XT] 
that our recovered data matrix has the additional property of being a low-leverage set of 
observations. 

The LLD formulation is related to recent proposals [HI [7] , and we discuss this point more in 
Section 1121 



As we were preparing this manuscript, we became aware of the independent work |46|, 147) 



which also considers (3.1) for the robust PCA problem. This work shows that, under certain 
hypotheses, the recovered low-rank data P^, has the same row-space as the true data and the 
corrupted rows are correctly identified. 



3.1. Low-Leverage by Duality. In this section, we demonstrate that (3.1) extracts a low- 
leverage model for the data. This result follows from duality arguments that characterize the 
optimum of the convex program. 
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Lemma 3.1 (First-order optimality conditions for (3.1 1). A feasible pair {P,C) is optimal 



for (3.1) if and only if there exists a matrix Q such that 



2^2' 



(Q,C)=7||C||;_ 



\Q\ 
WQW 



2-i-2 



< 1 



< 



7, 



(3.3a) 
(3.3b) 



Proof. It follows from standard subdifFerential conditions that a feasible point (P, C) mini- 
mizes the functional in (3.11 if and only if zero is in the subgradient of f{P) = ||P||i^_j.2 + 
7||X — P||2^^. By the additivity of subgradients [38l Thm. 23.8], this condition holds if 
and only if there exists a matrix Q such that the subgradient conditions Q E d\\P\ 



2^2 



and 



-gea7||C|| 



are in force. 



We show that these subgradient conditions are equivalent to (3.3). By definition of the 
subdifferential, Q E 9||P||2^2 if only if for every perturbation A the subgradient inequality 



(Q,A) < ||P + A| 



2^2 



2-S.2 



(3.4) 



holds. Suppose first that (3.3a) holds. Then, for all A, we have 
(Q,A) 



{Q,P 



PIT < 



IQIl 



where the inequality follows by the definition of dual norms. Since \\Q\\ < 1 by assumption, the 



subgradient inequality (3.4) must hold 



It remains to show that the subgradient inequality (3.4) implies (3.3a I. Taking A : 
in (3.4) gives (Q, P) < ||P||2^25 while A = — P gives the reverse inequality (Q,P) > ||P| 
Therefore the subgradient inequality (3.4) implies (Q,P) = l|P|| 



On the other hand, suppose that A 7^ satisfies (Q, A) 



l2-y2- 

IQII2- 



||A||2_>.2; such a matrix 

A must always exist in finite dimensions since suprema are attained in the trace definition of 



norms. Then the subgradient inequality (3.4 1 implies 

iiAii: < iip-i- Aii: 



\Q[ 



< IIAI 



where the second inequality follows by the triangle inequality. Since A 7^ 0, we have shown that 
the subgradient inequality implies ||(5||2^2 — ^- Hence Q E 9||P||2^2 equivalent to (3.3a 
The equivalence between Q E d'y\\C\\2^^ and relation (3.3b) follows analogously. 



□ 



Before continuing, we introduce another fact concerning the subgradient of unitarily invariant 
norms. Let P VEV* be the compact SVD of P. It follows from [44j that ( |3.3a[ ) implies 
Q = UV* + W, where, in particular, UV*W = 0. 

3.1.1. Leverage scores. The leverage score of the observation Xi corresponding to the ith row 
of X is given by the number where H ~ X(X* Xy X* is the orthoprojector onto the 

column space of X . We refer to H as the hat matrix in accord with common statistical practice. 
A large leverage score tends to indicate that the corresponding observation lies outside of the 
bulk of the data, although it does not necessarily indicate that the point is influential in linear 
regression. We refer to [3U Ch. 6] for further discussion of leverage scores. 

The following theorem shows that the leverage scores of our decomposition are bounded 
above by 7^, justifying the terminology low-leverage decomposition for the solution of the pro- 



gram (3.1 1 



Theorem 3.2. Suppose (P^,, C^) is an optimal point of the program (3.1). Then the diagonal 
elements of the hat matrix H = P.^{PlPi,y P* are bounded above by 7^ 

Proof. From the characterization of the subgradient of unitarily invariant norms [44j discussed 
above, we know that Q = UV* + W with UV*W* = 0. Thus, 

QQ* = UU* + WW* ^ L^L^* = H, 

where the last equality can be easily checked using the definition of H and the SVD of P*. 
Since the diagonal entries of a positive-semidefinite matrix are nonnegative, this relation implies 
[H]ii < [QQ*]ii. Recall that the £2 ^ oo op erator norm is the maximum £2 row norm of the 
matrix. Thus relation (3.3b I of Lemma 3.1 implies that [QQ*]ii < 7^^, which completes the 
proof. □ 
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We can view our proposal as a method of decomposing a data matrix X into a component 
with a (user-specified!) upper bound on the leverage plus an error term. Moreover, this result 
gives a statistical interpretation to the regularization parameter 7 in ( |3.1[ ). 

We note that while our program guarantees a low-leverage decomposition, an assumption of 
suitably small leverage is a technical hypothesis in other works, e.g., [6j eq. (1-2)]. 

The reader should be warned that this method does not necessarily produce a low-leverage 
solution if we use our program to identify outlying data and then "prune" the rows. That is. 



suppose {Pi,,Ci,) is an optimal point of (3.1) and = for row indices i & I. Then the 
corresponding matrix Xj — Pj does not necessarily have leverage scores bounded above by 7^. 

3.2. The Choice of 7. In this section, we study how the value of the regularization parameter 
7 affects the properties of the decomposition. 

We begin by showing that, when 7 > 1, the degenerate solution (Pi,,C*) = (^,0) mini- 



mizes (3.1). This claim follows by explicit construction. Let UTiV* be the compact SVD of 
X, and define Q = UV* . Clearly {Q,X) = ||X||2_j.2, so Q satisfies (3.3a) with Pi, = X. 



By construction, the maximum singular value of Q is bounded above by one. Equivalently, 
QQ* ^ I. This inequality implies [QQ*]ii < 1. Since the diagonal entries of QQ* are the 
squared row norms of Q, we have shown that |!Q||2_i.oo ^1^7- This bound demonstrates 



that Q satisfies (3.3b) with = 0, which certifies optiniality of this degenerate solution by 
Lemma 13.11 

We now show that the regularization parameter 7 gives an upper bound on the rank of the 
optimal Pj,. It is easy to show using the SVD of P^, that the trace of the hat matrix H defined 



above is equal the rank of P^. Since [H]ii < 7^ by Theorem 3.2 we must have 

rank(P*) = trace(jy) < 717^ (3.5) 

The rank is a positive integer, so 7 < l/\/n implies that the optimal P^ is trivial. Moreover, 
in order to get T meaningful components in Step [2] of Algorithm [2j we require rank(Pj,) > T. 
Thus, we can limit ourselves to situations where 7 € [y^T/n, 1]. 



Inequality (3.5) has implications for the numerical solution of ( |3.1[ ). As we discuss in Sec- 



tion 3.3 the bulk of the computation comes from computing an SVD at each iteration. When the 
solution of the optimization problem has low rank, the iterates also tend to have low rank. This 
allows us to save significant computational effort by computing partial singular decompositions 
at each step. A judicious choice of 7 can increase the performance of our algorithm immensely. 
We find that taking nj'^ w is a useful heuristic for achieving a rank-T optimal solution, so 
long as n'3> T^- 

On the other hand, typical statistical data does not show true low-rank behavior even when 
there are no outliers. Therefore, forcing the optimal decomposition to be low rank typically 
results in a dense corruption C^,. This effect may be mitigated somewhat by another formulation 



we discuss briefiy in Section 3.4 In practice we find that setting 7 somewhat less than y'p/n, 
say 7 = 0.8y/p/n, provides a very good low-rank model, but it does poorly in the context of 
outlier identification. We discuss specific parameter choices for our experiments in Section [5j 

3.3. Computing the Low-Leverage Decomposition. Although general-purpose semidefi- 



nite programming software such as CVX [121 H] can solve small instances of (3.1 ) efficiently, the 
interior-point methods they utilize may be unable to complete even a single iteration of a large- 
scale problem. This observation indicates that we need to use different methods for large-scale 
problems. 



To solve (3.1 ), we recommend an alternating direction augmented Lagrangian algorithm anal- 



ogous to the one used in [5]; see also |37]. The generic form of the method is known as the 



Augmented Lagrangian Method of Multipliers (ALMM). The augmented Lagrangian for (3.1) 
with dual variable Q is given by 

C,{P, C, Q) = \\P\\;^, + j\\C\\;^^ + (X-P-C,Q) + 1\\X-P-C\\l. (3.6) 

For an initial starting point P^, we alternately solve P '^^ = argmiiip /:^(P,C^Q'') and 
C'^^^ = argmiiif; £^(P'''"*'^, C, Q*"'). We then update the multiphcr by the feasibility gap 

Qk + l ^ Qk ^ _ pk + i _ ^fc + l^_ 
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The minimizations above have an exphcit form in terms of shrinkage operations [H] 

C''+^ = RowShrink (^X - P'' + (3.7a) 
p'^+i = SpecShrink (x - C''+^ + -Q^ , fi] , (3.7b) 



V M 
where RowShrink(A, v) soft-thresholds each row of A: 

RowShrink(:, v): — > diag([l - vl\\a,\\^] + ) ■ A, 

where [x]^ = max{a;,0}. Similarly SpecShrink(A, v) soft-thresholds the singular values of A 

SpecShrink(:, ly) : USV* ^ U [S - i^I]^ V* , (3.8) 

where the operator is applied element-wise. We initialize the algorithm with = and set 
the parameter /i = np/||X||2_^^. We stop the algorithm when the iterates are nearly feasible, 
that is, ||X - P'' - C*^!! < 10-^||X||p. 

The main computational difficulty when running this algorithm involves computing the spec- 
tral shrinkage operator. When the iterates P'^ are low rank, we can save significant compu- 
tational effort by performing only partial singular value decompositions |27j . We can leverage 
our analysis in Section |3.2| to ensure that the optimal Pi, is low rank. Since the algorithmic 
iterates tend to be low-rank in this case, we can significantly improve the performance of our 
algorithm by choosing 7 to limit the rank of the optimal solution. In practice, we have found 
that one should set the quantity 717^ somewhat larger than the desired rank of the solution, e.g., 
ri7^ « when we desire a rank-T solution. 

3.4. Extensions for a Noisy Model. We note that there is an obvious extension of the LLD 
when one wants to account for an additional of noise in the model. Suppose that in addition to 
gross corruptions of certain observations, we would also like to model small corruptions or noise 
that may be spread throughout the data. 

Instead of enforcing the equality X = P + C, we allow for some additional slack of the form 
\\X — P — C\\p < ?7, where 77 is an estimate for the noise level. That is, we solve the problem 

minimize H-PHz^a + 7||C'||2^^ 
subject to ||X - P - Clip < ?7 



When f] — 0, this is equivalent to our proposal (3.1) for the gross corruption model. Other loss 



functions are also possible. Note that the Frobenius norm remains invariant under a rotation on 



the right, which is a feature of (3.1 ) that we would like to preserve 



This formulation is also studied in the independent work ^46. ,47, . It is shown there that under 



some technical conditions, the decomposition from (3.9) results in a decomposition where is 
close to a matrix with the same row-space as the true observations, and the matrix d, is close 
to a matrix that correctly identifies the column support of the corruption. 



4. Previous Work 

This section describes previous work on robust formulations of principal component analysis. 
Convex approaches to robust PCA are unusual, and, as a consequence, many other attempts 
at robust PCA lack rigorous algorithms. Often, proposals are put forward with a mathemati- 
cal formulation and only a heuristic algorithm — or an algorithm without a clear mathematical 
formulation. 

In Sections |4.I| and |4.2[ we describe the two methods in the literature most closely related 
to our proposals. We then describe in detail an approach for robust PCA recommended by 
Maronna |29j with which we provide comparisons in Section [s] We conclude with a short 
overview of other robust PCA proposals that have appeared in the literature. 
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4.1. Antecedents for MDR: Projection Pursuit PCA. Our MDR proposal is a particular 
instance of an approach that has come to be known as projection-pursuit PCA (PP-PCA), as 



we discuss in Section 2.2 The theoretical properties of PP-PCA are well understood; see for 
instance [12] and [TT] . 

All of the algorithms we have found in the literature for computing PP-PCA are meant to 
operate with an arbitrary scale. In view of the fact that the PP-PCA problem is NP-hard, 
it is unsurprising that the literature appears to contain no PP-PCA algorithms with proofs of 
correctness and tractability. Indeed, we have been unable to find other work that recognizes 
that the PP-PCA problem is intractable in a rigorous sense. 

The original study of Li and Chen |26j uses a Monte Carlo approach that was found to be 
computationally expensive. In theory, even simple Monte Carlo methods (e.g., randomly sam- 



pling the unit sphere) can produce arbitrarily good solutions to problem (2.2) with an arbitrary 
(continuous) scale. Given the computational hardness of the problem, it is unlikely that Monte 
Carlo approaches can provide guarantees of computational efficiency. 

Current algorithms for PP-PCA rely on heuristics. A popular and fast algorithm for generic 
projection-pursuit PCA is the finite direction method (FDM) of Croux and Ruiz-Gazen [TT] . 
This technique replaces the search over the entire unit sphere ||t'||2 = 1 with a finite search over 
the directions that appear among the observations: v G {a;i/|ja;i|j2, . . . , £c„/|ja;„||2}. The hope 
is that directions of large scale are likely to be well approximated by directions appearing in the 
data. This heuristic to performs poorly when n and p are large because it takes an extremely 
large number of points to cover a high-dimensional sphere. 

4.2. A convex approach. Recently, a method of Chandrasekaran et al. [J has been adapted 
for robust PCA in [6 . This approach attempts to decompose the data matrix into a sum of a 
low-rank matrix and a sparse matrix via the semidefinite program 

minimize H^Hz^a + ^ll-^lli^oo 
subject to L + S = X . 

The nuclear norm ||-||2_>2 promotes low rank and the matrix £i norm promotes sparsity. 

We refer to this method as N-I-Ll. The works [6, 7J provide conditions under which N -I- LI 
succeeds in exactly recovering a low-rank and sparse component. 

This convex approach is principled in the sense that the mathematical formulation is also 
algorithmically tractable. On the other hand, it lacks an invariance to a change in the observation 
basis possessed by all other methods we discuss, including standard PCA. That is, applying a 
rotation U*U = I to the data X = XU does not result in a similar rotation of the decomposition 
due to the fact that the norm ||-|| 2-5.00 ^'-'^ invariant under this transformation. 

One may argue that this invariance is inconsequential; in real data, the particular choice of 
coordinates has a meaning and outliers may occur coordinate-wise. This argument is defensible 
in domain specific examples, such as image data that contain specularities Nevertheless, PCA 
is intended to locate a coordinate basis that explains data more effectively than the standard 
basis 23 . If this is the analytical goal, basis invariance is indeed a requisite property. See 
Section |5.1.2| for an experiment where this lack of orthogonal invariance in N -t- LI appears to 
produce unnerving results. 

4.3. Spherical PCA. Another approach, known as spherical principal components (sphPCA) [5B 
rescales the observations to unit (Euclidean) norm and applies standard PCA to this modified 
data. To implement the sphPCA method, we first compute a normalized matrix X. Each 
row of X is the normalized version of the corresponding row of the centered data matrix X, 



that is Xi = Xi/\\xi\\2. Using the row-normalization operator from (2.11), we can express the 
normalized matrix as X = Af{X). 

The robust components are then defined as the standard principal components of the rescaled 
matrix X. Since all of the observations from the normalized matrix X have norm one, there are 
no large magnitude observations that exert an undue influence on the principal components. 

A study by Maronna [29] shows that sphPCA enjoys good practical performance. The ease 
of implementation and relatively good behavior of sphPCA leads Maronna to suggest it as the 
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default choice for robust principal component analysis. As a result, we use sphPCA as a baseline 
comparison for the performance of our robust methods in Section [5j 

4.4. Other proposals. Some of the earliest methods for robust PCA compute approxima- 
tions of correlation or covariance matrices using robust methods. Gnanadesikan and Kettenring 
propose direct robust estimation of the covariance matrices through robust estimation of the 
individual entries [T^- This may lead to counterintuitive results such as non-positive covari- 
ance matrices. An alternative approach explicitly enforces positive matrices as minimizers of a 
functional such as an M-estimator [T3]; see also the more recent study [TU] . 

A representative example of robust PCA from the machine learning community is the work 
of De La Torre and Black jl3j . They define the robust components as the minimum of a highly 
non-convex energy function and attempt to minimize this energy function using an iteratively 
reweighted least-squares algorithm coupled with an annealing step. No theoretical guarantees of 
correctness for the algorithm are provided. 

Another recent approach appears in the paper |45j of Xu et al. This algorithm randomly 
removes observations that appear to have high influence in the current estimate of the principal 
components. The principal component estimate is computed from the trimmed data. Xu et al. 
are able to establish strong theoretical properties of their algorithm, including a high breakdown 
point in the high-dimensional scaling regime where n — cx) and n/p — > c > 0. 

5. Numerical Experiments 

This section provides some numerical examples comparing our proposals with standard PCA 
and other robust PCA methods in the literature. In Section [O] we look at the projection of two 



data sets on the top robust component. Section 5.2 repeats a multiple-component experiment of 



Maronna [30 with additional robust methods. Section 5.3 contains a larger experiment, where 
we calculate the first two components of a dense matrix with more than twenty million entries. 

All of these experiments and algorithms are implemented with Matlab. Following the prin- 
ciple of reproducible research , we provide code that reproduces the exact experiments in this 
work pTj . 

5.1. Projection onto the top component. In this section, we study the robust component 
methods applied to two data sets. The first set is a selection of environmental factors that 
may affect the concentration of nitrogen dioxide around Oslo, Norway. The second example is 
constructed from standard iris data. In each case, we examine the spread of the data in the 
direction of the top robust component. 

5.1.1. Experimental setup. For these experiments, we center the data by removing the Euclidean 
median from each observation. The Euclidean median /i is a robust estimate of the center of 
the data, and is defined as 

En 
\\x,-fi\\^. (5.1) 
fi 

Maronna [301 Ch. 9 ] gives a method to solve this convex problem for fl. 

We project the data onto the top component for each method and compare the performance 
of the methods by the interquartile range (IQR), that is, the distance between the 25th and 75th 
percentile of the projected data. 

We apply extract the dominant component from each data set using our methods (MDRand 
LLD), other robust methods (sphPCAand N + LI), and standard PCA. For MDR, we use K ^ 94 



in |6 l, we 

iris data in Section [5. 1.3 we find that A = l/^/n gives a trivial result: no outliers were identified 
by N + LI. Instead, we use the more favorable choice A — 0.3/ y/n. 

5.1.2. Norwegian nitrogen dioxide data. Our data for this experiment consists of 500 observations 
of eight environmental factors around Oslo, Norway, available on the Statlib archive pj. The 
variables include the log-concentration of nitrogen dioxide (NO2) particles, the number of cars 
per hour, and the wind speed, as well as several additional factors useful for predicting the 
concentration of NO2 particles. 



rounding trials as discussed in Section 2.4 We set the LLD weight parameter 7 = O.Sy'p/n. As 
recommended in |6j, we set the N -I- LI parameter A = l/\/n for the first experiment. With the 
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Figure 1. Projection of the Oslo NO2 data set onto top components. The 
box surrounds the middle 50% of the data. The vertical line in the box is the 
median of the data. Each whisker extends either 1.5 times the length of the IQR 
or to the extreme value of the data, and the red crosses beyond the whiskers 
are the outlying points. The plots are ordered by decreasing IQR. 

We calculate the top component of the data using each method. In Figure [T] we plot the 
projection of the data onto the direction of these components using a standard box-and-whisker 
plot. The whiskers extend either 1.5 times the IQR beyond the edge of the box or to the extreme 
data point. We consider points that lie beyond the whiskers outliers. We give the percentage of 
outliers and several order statistics of the data in Table [21 

Every robust method results in a larger IQR than PCA. The MDR component finds the 
largest IQR, and the LLD method finds the smallest IQR among the robust methods. Except 
for N + LI, every method identifies a direction with a relatively large number of outliers, which 
indicates that the data has heavy tails. 

The N + LI method is unique because it does not identify a direction of large spread outside 
of the middle 50% of the data. We have observed that a random change of the observation 
basis causes the N + LI component to perform similarly to the LLD component. By orthogonal 
equivariance, the results for methods other than N + LI are unchanged by a change in the 
observation basis. This indicates that the behavior of the results given by the N + LI component 
is due to the lack of orthogonal equivariance. 

We note that the approximation ratio for the top MDR component is near optimal at 0.978. 

Table 2. Statistics for the projected NO 2 data. The last column lists the 
percentage of points lying outside the whiskers in Figure [l] 



Method 


IQR 


min 


25th 


75th 


max 


out 


MDR 


2.57 


-9.07 


-1.53 


1.05 


10.82 


5.00% 


sphPCA 


2.53 


-9.06 


-1.45 


1.08 


10.71 


5.60% 


N+Ll 


2.38 


-4.58 


-1.34 


1.05 


2.79 


0.00% 


LLD 


2.27 


-9.29 


-1.27 


1.00 


11.24 


7.40% 


PCA 


1.89 


-9.51 


-1.08 


0.81 


12.18 


11.00% 



5.1.3. Iris data. We use Fisher's iris data |T6] in this experiment. The data contains 60 obser- 
vations from three different species of iris: Iris setosa, Iris virginica, and Iris versicolor. Each 
observation consists of four measurements, namely sepal length, sepal width, petal length, and 
petal width. 

Fifty of the observations come from the setosa fiowers. We corrupt these observations with 
5 measurements of Iris virginica and five measurements of Iris versicolor. We hope that ro- 
bust principal components identify a direction of large spread in the setosa bulk of the data. 
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As a baseline comparison, we also calculate the dominant principal component of the setosa 
population without the outlying flowers. 

As in Section 5.1. 2[ we project the data onto the direction of the dominant components. 
These points are plotted in Figure [2j we distinguish the bulk setosa points from the versicolor 
and virginica observations. We compute an approximate density of the setosa observations by 
convolving the projected data with a unit volume Gaussian kernel of width a — 0.2. Table [3] 
gives some order statistics of the projections. 



Table 3. Order statistics for the projection of the setosa data onto the top 
components. The last column lists the number of setosa points further than 1.5 
times IQR left of the 25th percentile or the right of the 75th percentile. 



Method 


IQR 


min 


25th 


75th 


max 


out 


LLD 


0.70 


-1.21 


-0.41 


0.29 


1.14 


0.00% 


Setosa PCA 


0.70 


-1.22 


-0.41 


0.29 


1.14 


0.00% 


sphPCA 


0.69 


-1.19 


-0.41 


0.28 


1.13 


0.00% 


N+Ll 


0.66 


-1.16 


-0.40 


0.26 


1.07 


0.00% 


MDR 


0.37 


-0.79 


-0.24 


0.13 


0.53 


0.00% 


PCA 


0.19 


-0.60 


-0.15 


0.04 


0.37 


6.00% 



The dominant component of LLD, sphPCA, and N + LI each achieves an IQR at least 3 
times that of PCA. These components do not clearly distinguish among the three populations, 
indicating that these methods are insensitive to the effect of the outliers. LLD and sphPCA 
appear the most effective in this situation; indeed, it appears that LLD and sphPCA perform as 
well as setosa-only PCA. 

Although MDR results in the most modest IQR in the setosa among the robust methods, the 
IQR associated with the MDR component is 1.95 times the IQR of the setosa family along the 
dominant PCA component. Unlike the other robust methods, the MDR component discriminates 
among the three distinct populations. While it is clear that MDR does not reject the influence of 
the outliers, MDR balances the influence of outliers and the bulk of the data better than PCA. 
In this experiment the optimality ratio for MDR is 0.9975, certifying that the MDR component 
is essentially the direction of maximum mean deviation in the data. 

5.2. Regression Surface for Bus Data. In this experiment, we construct a regression surface 
using multiple components. A point is well described by a surface if its Euclidean distance 
from the surface is small. The dominant T classical principal components span a T-dimensional 
regression surface such that the sum of the squared distances of the observations to the plane is 
minimized. We would hope that robust components describe the bulk of the points better than 
standard components when outliers contaminate the data. We illustrate this behavior with an 
experiment of Maronna et al. |301 p. 214], which we augment with additional robust methods. 

5.2.1. Experimental setup. Our data consists ofp — 18 geometric features collected from n — 218 
bus silhouettes [40 that we arrange into an nxp matrix X . Following Maronna et al., we remove 
the 9th variable from the data and divide the columns of X by their median absolute deviation 
(MADN) , a robust measure of scale defined as 

MADN(a;) = median(|a; — median(a;)|). 

We then center the observations by their Euclidean median. We compute the top three compo- 
nents using PCA, MDR, LLD, sphPCA, and N + LI. We take the LLD parameter 7 = 0.8 \/n/m, 
the N + LI parameter A = -^/l/m, and the rounding count of MDR K — 94. 

For each method, we determine the Euclidean distance from each observation to the orthogonal 
regression plane spanned by the dominant three components. In Figure [3j we plot the ordered 
distances to the robust hyperplanes against the ordered distances to the PCA hyperplane. 

Since the PCA regression surface minimizes the sum of squared distances to the observations, 
not all of the observations can lie below the 1:1 line. However, a large number of points below 
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Figure 2. The projections of the iris data onto the top components. The 
points are randomly jittered above the zero hne for readabihty. The blue curve 
represents the approximate local point density of setosa. Note that LLD and 
sphPCA essentially provide the same projection as PCA without outliers. We 
sort the plots by decreasing IQR. 
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Figure 3. The distance of points to robust regression surfaces as a function of 
the distance of points to the standard PCA regression surface. The regression 
surface is deterrained by the top three components from each method. Points 
to the left of the median follow the same generic pattern as points in the 3rd 
quartile, and are therefore omitted. Three extreme points to the right are also 
omitted. 



the 1:1 line indicates that a robust regression surface explains the bulk of the data better than 
the classical surface. 

5.2.2. Discussion. Figure |3] focuses on the third and fourth quantiles of the data; the first and 
second quantiles roughly follow the pattern apparent in the third quantile. For clarity, we omit 
the three most outlying points that would appear in the upper right corner of the figure. Each 
robust method results in a regression surfaces that explains the data better than PCA for more 
than 75% of the points. In the third quantile, both N + LI and sphPCA lose their explanatory 
advantage over PCA. It is not until the after 95% of the data that MDR and LLD provide worse 
explanations than PCA. LLD is the dominating method through the latter part of the data. 

MDR explains the bulk of the data less effectively than the other robust methods, yet the final 
outlying observations are explained better by MDR than the other methods. This indicates that 
MDR is more sensitive to outlying points than the other robust methods, but is less sensitive 
to outliers than standard PCA. The optimality ratios for the first three MDR components are, 
respectively, 0.99999, 0.99992, and 0.97253, implying that MDR essentially succeeds in PP-PCA 
with the MD scale for this data. 

Finally, we note that changing the N + LI parameter to A — l^fxjn results in performance 
similar to LLD. 

5.3. Movielens. We finish this section with a larger example: the million-rating movielens 
data [21]. The data consist of 6040 users rating and 3952 movies, though several movies are 
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Table 4. Most important movies as given by the first standard principal com- 
ponent. The decimal numbers represent the weight each component puts on a 
movie. The integer to the right of the weight is the rank of the movie under the 
given component. 



Movie 


PGA 


MDR 


SPH 


LLD 


GoodFellas 


0.0708 


1 


0.1016 


1 


0.1092 


1 


0.0885 


1 


Army of Darkness 


0.0697 


2 


0.0914 


3 


0.0970 


4 


0.0832 


2 


A Little Princess 


0.0664 


3 


0.0899 


4 


0.1028 


2 


0.0826 


3 


Pushing Hands 


0.0657 


4 


0.0772 


11 


0.0827 


10 


0.0745 


8 


Stand by Me 


0.0656 


5 


0.0853 


5 


0.0896 


5 


0.0765 


5 



Table 5. Most important movies: second component. 



Movie 


PGA 


MDR 


SPH 


LLD 


Nikita 


0.0982 


1 


0.1099 


2 


0.0959 


13 


0.1071 


5 


Citizen Kane 


0.0945 


2 


0.1051 


4 


0.0935 


15 


0.1104 


3 


Fried Green Tomatoes 


0.0917 


3 


0.0934 


8 


0.0727 


35 


0.0944 


11 


Unforgiven 


0.0891 


4 


0.0923 


10 


0.0877 


21 


0.0982 


9 


Mommie Dearest 


-0.0855 


5 


-0.1108 


1 


-0.1522 


1 


-0.1281 


1 



replicated. The set contains just over one million ratings. Each rating is between one and five 
stars, and each user in the data set rated at least 20 movies. 

We arrange these responses into an n = 6040 by p = 3952 matrix X whose rows correspond 
to the users and whose columns correspond to the movies. We set unrated movies to the user's 
median rating, and center each user's ratings by their personal median. As with our other 
experiments, we center the rows by the Euclidean median, which results in a dense matrix with 
nearly 24 million entries. 

We then compute the top two components using PGA, MDR, LLD, and sp hPG A. In order 

v/iooT^. 



3.3 



this choice 



to speed up processing for LLD, we set 7 = ^100/p. As discussed in Section 
of 7 limits the rank of the iterates P*^''^ in the ALMM algorithm, which allows us to compute a 
partial SVD at each step. Our choice 7 = -^/lOO/n results in iterates whose rank is roughly 10; 
the rank of the optimal point P^, is nine. 

Each component v represents a direction in movie coordinates. The magnitude entry [v]i 
indicates how much v points in the direction of movie i. We use these magnitude of the entries 
in the components to rank the movies. We call movies with large magnitudes "important," and 
we call the corresponding entry of the component a movie's "importance." 

5.3.1. Discussion. Table [4] displays the five most important movies identified by the first stan- 
dard principal component, along with the importance and rank calculated assigned to these 
movies by the robust components. Each method agrees that the violent mobster movie GoodFel- 
las is the most important film. Indeed, GoodFellas, Army of Darkness, A Little Princess, and 
Stand by Me are ranked in the top five movies by every method. However, PGA ranks Pushing 
Hands much higher than the robust methods. 

In Table |4j each importance has positive sign. For each method, the first component assigns 
very few movies a negative importance for the first component. This fact comes about because 
the typical user rating is positive; that is, the sum X^j l^jlj greater than zero for most users. 

Table [5] displays the results for the second components. Each robust component views Mom- 
mie Dearest as the most important movie, while standard PGA relegates it to fifth place. Neither 
Fried Green Tomatoes nor Unforgiven are among the top five movies for the robust methods. 
With the second component, sphPGA takes the most dramatic shift away from PGA, with only 
Mommie Dearest making it into the top ten movies. 
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Of course, rankings are not the whole story. The signs are very consistenij^ between methods. 
Mommie Dearest is negative for every method considered and Fried Green Tomatoes is positive. 
The sign consistency indicates that these components are measuring essentiaUy the same thing. 

The magnitude of the importance are also telling. PCA assigns the smallest weight to every 
movie, with the exception of the second component of sphPCA. This indicates that the robust 
methods are willing to assign more importance to discriminating movies. 
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Appendix A. Proof of Theorem 12.21 

This appendix contains the proof of Theorem |2.2| that we repeat below as Theorem |A.4| We 
begin with some supporting results. The following result of Alon and Naor Sec. 4.2] allows 
us to bound the expectation of below. The essence of this result goes back to a 1953 

paper of Grothendieck [20]; see also the little Grothendieck theorem in [36l Sec. 5b]. 

Lemma A.l. Let be the value of the optimization problem (2.101 of Algorithm^ Then > 
Moreover, let y^*''-' be one of the vectors generated in Stepl^ Then E || [[^ > 

The claim > ||XX* also follows from our discussion of the SDP relaxation in Sec- 

tion [2Al] We also need the following proposition. 

Proposition A. 2. For each matrix X , the identity \\XX*\\^^^ = j|X||2^^ holds. 
Proof. We can express 

ll^^*lloo^i-„max {X*w,X*y). 

\\w\\ ^ — 1 

ll2/lloo=l 

By the conditions for equality in the Cauchy-Schwarz inequality, it follows that we can take 
w = y above. Hence 

ll^^1loo^l = ll^1lL^2 = ll^lll^2- 

where the last equality is a standard fact concerning adjoint operators. □ 

We use the following variant of the Paley-Zygmund integral inequality (35) to bound the 
probability that || is less than its expectation. 

Lemma A. 3. Suppose Z is a random variable such that < Z < C for some C > 0. Then, for 
any scalar 9 G [0, 1], we have V{Z > 0E[Z]) > C'^il - e)E[Z]. 

Proof. Split the integral E[Z] into two integrals, the first over the region Z < OE[Z] and the 
second over the region Z > 9E[Z]. Notice that the former integral is bounded above by 6'E[Z], 
while the latter integral is bounded above by CP{Z > 6'E[Z]). Simple algebraic manipulation 
then shows the claim. □ 

We now restate and prove the main Theorem of Section [2j 

Theorem A. 4. Suppose that X is an nx p matrix, and let K he the number of rounding trials. 
Let {Vi,,ai,) he the output of Algorithm^ Then > ||X||2_j.]^. Moreover, for 9 € [0,1], the 
inequality 

\\Xv4^ > O^a^ 
holds except with probability )h ^ 



•^Since components are only defined up to a sign, we mean that the sign pattern in Tables [4] and [5] are equivalent 
modulo multiplication by —1. 
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Proof. Let y G {±1}" be a sign vector and define v = X*y/\\X*y\\.2. Then 
\\Xv\\,^\\X*y\\-' max {w,XX*y) > \\X*y\\, 

where the inequaUty follows by taking the specific choice w ^ y. In particular, this relation 
implies that the vectors t;'^'''' — X*y^'^^/||X*y('^^ [[^ generated in Step [2] of Algorithm [I] satisfy 

EllXv^'^Mf^ > E||X*y('^')||' > -al, (A.l) 
II Hi II 112 ^ 

where the last inequality follows from the second claim in Lemma |A.l| 

Since Hf'-'^-' II2 = 1, the quantity is a pos itive random variable bounded above by 



-X^ll2-i.i- Therefore, inequality (A.l I and Lemma A. 3 imply that 



|2 ,9 2al\ ^ 2 / a* \\ 2 



F(\\Xv^%>9^--^'j>{l-9^)^-^^^^^J >^.{l-0% (A.2) 



where we have used the fact that a* > ||X||2^]^ by Proposition A.2 and the first claim of 
Lemma lA.ll 



In Step[3|of the algorithm we have chosen v^, to maximize , so the inequality || < 

2(1 — 9'^)/n holds if and only if IIXi)^*''^ || 1 < 2(1 — 6'^)/7r for aU k. Therefore, the independence 
of v^''^ tor k = 1, . . . ,K implies 



K 



vr 

which completes the claim. □ 

References 

[1] M. Aldrin. N02.dat. |http : // lib ■ stat . emu. edu/datasets/H02 ■ dat| 2004. 

[2] N. Alon and A. Naor. Approximating the Cut-Norm via Grothendieck's Inequality. SIAM J. Comput., 
35(4):787, 2006. 

[3] J. Buckheit and D. Donoho. Wavelab and reproducible research, 1995. 

[4] S. Burer and R. D. C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via 

low-rank factorization. Math. Program., 95(2):329-357, 2003. 
[5] S. Burer and R. D. C. Monteiro. Local Minima and Convergence in Low-Rank Semidefinite Programming. 

Math. Program., 103(3) :427-444, Dec. 2004. 
[6] E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? preprint, Dec. 2009. 

arXiv:0912.3599. 

[7] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Rank-Sparsity Incoherence for Matrix 

Decomposition, preprint, June 2009. arXiv:0906.2220. 
[8] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling 

and Simulation, 4(4):1168-1200, 2006. 
[9] C. Croux, P. Filzmoser, and M. R. Oliveira. Algorithms for projection-pursuit robust principal component 
analysis. Chemom. Intell. Lab. Syst., 87:218-225, 2007. 
[10] C. Croux and H. Haesbroeck. Principal component analysis based on robust estimators of the covariance or 

correlation matrix: influence functions and efficiencies. Biometrika, 87(3):603-618, Sept. 2000. 
[11] C. Croux and A. Ruiz-Gazen. High breakdown estimators for principal components: the projection-pursuit 

approach revisited. J. Multivariate Anal, 95(l):206-226, 2005. 
[12] H. Cui. Asymptotic distributions of principal components based on robust dispersions. Biometrika, 90(4) :953- 
966, Dec. 2003. 

[13] F. De La Torre and M. Black. A framework for robust subspace learning. Int. J. Comput. Vision, 54(1):117- 
142, 2003. 

[14] S. Devlin, R. Gnandesikan, and J. Kettenring. Robust estimation of dispersion matrices and principal com- 
ponents. J. Am. Stat. Assoc., 76(374) :354-362, 1981. 

[15] M. Fazel. Matrix rank minimization with applications. Dissertation, Stanford University, Stanford, CA, 2002. 

[16] R. A. Fischer. The use of multiple measurements in taxonomic problems. Ann. Eugenic, 7:179—188, 1936. 

[17] R. Gnanadesikan and J. R. Kettenring. Robust estimates, residuals, and outlier detection with multiresponse 
data. Biometrics, 28(1):81-124, 1972. 

[18] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and 
H. Kimura, editors. Recent Advances in Learning and Control, Lecture Notes in Control and Information 
Sciences, pages 95-110. Springer- Verlag Limited, London, 2008. |http : //Stanford ■ edu/-boyd/graph_dcp ■ | 
Etmli 

[19] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. |http : | 
|//cvxr.com/cvx| Oct. 2010. 



TWO PROPOSALS FOR ROBUST PCA 



21 



A. Grothcndicck. Resume de la theorie metrique des produits tensoriels topologiques (French). Bol. Soc. 

Mat. So Paulo, 8:1-79, 1953. 

Grouplens Research. MovieLens Data Sets. |http: //www.grouplens . org/system/f lles/mllllon-ml-data.] 
|t£Lr . gz 

W. W. Hager and H. Zhang. Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed 
descent. ACM Trans. Math. Software, 32(1):137, 2006. 

H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational 
Psychology, 24(6):417-441, 1933. 

P. J. Huber. Robust statistics. Wiley, Hoboken, New Jersey, first edition, 1981. 

P. J. Huber and E. Ronchetti. Robust statistics. Wiley, Hoboken, New Jersey, second edition, 2009. 

G. Li and Z. Chen. Projection-pursuit approach to robust dispersion matrices and principal components: 

primary theory and Monte Carlo. J. Am. Stat. Assoc., 80(391):759-766, 1985. 

Z. Lin, M. Chen, L. Wu, and Y. Ma. The augmented lagrange multiplier method for exact recovery of 
corrupted low-rank matrices. Math. Program., submitted, 2009. arXiv:1009.5055. 

N. Locantore, J. S. Marron, D. G. Simpson, N. Tripoli, J. T. Zhang, K. L. Cohen, G. Boente, R. Fraiman, 

B. Brumback, C. Croux, J. Fan, A. Kneip, J. I. Marden, D. Peiia, J. Prieto, J. O. Ramsay, M. J. Valderrama, 
and A. M. Aguilera. Robust principal component analysis for functional data. Test, 8(1);1— 73, June 1999. 
R. A. Maronna. Principal Components and Orthogonal Regression Based on Robust Scales. Technometrics, 
47(3):264-273, Aug. 2005. 

R. A. Maronna, D. R. Martin, and V. J. Yohai. Robust Statistics: Theory and Methods. Wiley Series in 
Probability and Statistics. Wiley, Hoboken, NJ, 2006. 

M. McCoy and J. A. Tropp. Online code, 2010. http://www.acin.caltech.edu/~niccoy/ 

D. C. Montgomery, E. A. Peck, and G. G. Vining. Introduction to Linear Regression Analysis. Wiley Series 
in Probability and Statistics. Wiley, Hoboken, NJ, 2006. 

A. Nemirovski. Sums of random symmetric matrices and quadratic optimization under orthogonality con- 
straints. Math. Program., 109:283-317, January 2007. 

Y. E. Nesterov. Semidefinite relaxation and nonconvex quadratic optimization. Optim. Methods Softw., 
9(1):141-160, 1998. 

R. E. A. C. Paley and A. Zygmund. A note on analytic functions in the unit circle. Math. Proc. Cambridge 
Philos. Soc, 28(03) :266-272, Oct. 1932. 

G. Pisier. Factorization of linear operators and geometry of Banach spaces. Regional Conference Series in 
Mathematics. American Mathematical Society, Providence, RI, 1986. 

B. D. Rao and K. Kreutz-Delgado. Sparse solutions to linear inverse problems with multiple measurement 
vectors. Proceedings of the 8th IEEE Digital Signal Processing Workshop, 1998. 

R. T. Rockafellar. Convex Analysis. Princeton Mathematical Series. Princeton University Press, 1970. 
J. Rohn. Computing the Norm is NP-hard. Linear and Multilinear Algebra, 47(3): 195-204, 2000. 

J. P. Siebert. Vehicle Recognition using Rule Based Methods, 1987. Turing Institute Research Memorandum 
TIRM-87-018. 

A. M.-C. So. Improved approximation bound for quadratic optimization problems with orthogonality con- 
straints. Symposium on Discrete Algorithms, 2009. 

J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer, New York, NY, 2002. 
J. W. Tukey. A Survey of Sampling from Contaminated Distributions. In I. Olkin, editor. Contributions to 
probability and statistics: essays in honor of Harold Hotelling, pages 448-474. Stanford University Press, 
Stanford, CA, 1960. 

G. Watson. Characterization of the Subdiffcrential of Some Matrix Norms. Linear Algebra Appl., 170:33-45, 
June 1992. 

H. Xu, C. Caramanis, and S. Mannor. Principal component analysis with contaminated data: The high 
dimensional case, preprint, pages 1-37, 2009. arXiv:1002.4658. 

H. Xu, C. Caramanis, and S. Sanghavi. Robust PCA via Outlier Pursuit, preprint, pages 1-24, 2010. 
arXiv:1010.4237. 

H. Xu, C. Caramanis, and S. Sanghavi. Robust pea via outlier pursuit. In J. Lafferty, C. K. I. Williams, 
J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, NIPS 23, pages 2496-2504. 2010. 



